1 Setup

pacman::p_load(tidyverse, gapminder, nycflights13, ggthemes)

1.1 Visualizing (ggplot2)

  • The following material is adapted from Kieran Healy’s excellent book (2019) on data visualization and Hadley Wickham’s equally excellent book on ggplot2. For more theoretical discussions, I recommend you to read The Grammar of Graphics by Leland Wilkinson.

  • Why should we care about data visualization? More precisely, why should we learn the grammar of statistical graphics?

  • Sometimes, pictures are better tools than words in 1) exploring, 2) understanding, and 3) explaining data.

1.1.1 Motivation

Anscombe’s quarter comprises four datasets, which are so alike in terms of their descriptive statistics but quite different when presented graphically.

# Set theme
theme_set(theme_minimal())
# Data
anscombe
# Correlation
cor(anscombe)[c(1:4), c(5:8)]
##            y1         y2         y3         y4
## x1  0.8164205  0.8162365  0.8162867 -0.3140467
## x2  0.8164205  0.8162365  0.8162867 -0.3140467
## x3  0.8164205  0.8162365  0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
# gather and select
anscombe_processed <- anscombe %>%
  gather(x_name, x_value, x1:x4) %>%
  gather(y_name, y_value, y1:y4)

# plot
anscombe_processed %>%
  ggplot(aes(x = x_value, y = y_value)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  facet_grid(x_name ~ y_name) +
  theme_bw() +
  labs(
    x = "X values",
    y = "Y values",
    title = "Anscombe's quartet"
  )
## `geom_smooth()` using formula 'y ~ x'

1.1.2 The grammar of graphics

  • the grammar of graphics

    • data
    • aesthetic attributes (color, shape, size)
    • geometric objects (points, lines, bars)
    • stats (summary stats)
    • scales (map values in the data space)
    • coord (data coordinates)
    • facet (facetting specifications)

No worries about new terms. We’re going to learn them by actually plotting.

  • Workflow:

    1. Tidy data
    2. Mapping
    3. Geom
    4. Cor_ordinates and scales
    5. Labels and guides
    6. Themes
    7. Save files

1.1.3 mapping and geom

  • aes (aesthetic mappings or aesthetics) tells which variables (x, y) in your data should be represented by which visual elements (color, shape, size) in the plot.

  • geom_ tells the type of plot you are going to use

1.1.4 basic aes (x , y)

p <- ggplot(
  data = gapminder,
  mapping = aes(x = gdpPercap, y = lifeExp)
) # ggplot or R in general takes positional arguments too. So, you don't need to name data, mapping each time you use ggplot2.

p

p + geom_point()

p + geom_point() + geom_smooth() # geom_smooth has calculated a smoothed line;
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# the shaded area is the standard error for the line

1.1.5 Univariate distribution

  • geom_histogram(): For the probability distribution of a continuous variable. Bins divide the entire range of values into a series of intervals (see the Wiki entry).
  • geom_density(): Also for the probability distribution of a continuous variable. It calculates a kernel density estimate of the underlying distribution.

1.1.5.1 Histogram

data(midwest) # load midwest dataset

midwest
midwest %>%
  ggplot(aes(x = area)) +
  geom_point() # not working.
midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram() # stat_bin argument picks up 30 bins (or "bucket") by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

midwest %>%
  ggplot(aes(x = area)) +
  geom_histogram(bins = 10) # only 10 bins.

ggplot(
  data = subset(midwest, state %in% c("OH", "IN")),
  mapping = aes(x = percollege, fill = state)
) +
  geom_histogram(alpha = 0.7, bins = 20) +
  scale_fill_viridis_d()

1.1.5.2 Density

midwest %>%
  ggplot(aes(x = area, fill = state, color = state)) +
  geom_density(alpha = 0.3) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()

1.1.6 Advanced aes (size, color)

  • There’s also fill argument (mostly used in geom_bar()). Color aes affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and in some cases, the interior of a smoother’s standard error ribbon.

  • The property size/color/fill represents…

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop
  )
) +
  geom_point()

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = continent
  )
) +
  geom_point() +
  scale_color_viridis_d()

# try red instead of "red"
ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    size = pop,
    color = "red"
  )
) +
  geom_point()

Aesthetics also can be mapped per Geom.

p + geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p + geom_point(alpha = 0.3) + # alpha controls transparency
  geom_smooth(color = "red", se = FALSE, size = 2, method = "lm")
## `geom_smooth()` using formula 'y ~ x'

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

ggplot(
  data = gapminder,
  mapping = aes(
    x = gdpPercap, y = lifeExp,
    color = continent,
    fill = continent
  )
) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  scale_color_viridis_d() +
  scale_fill_viridis_d()
## `geom_smooth()` using formula 'y ~ x'

1.1.7 Co-ordinates and scales

p + geom_point() +
  coord_flip() # coord_type

The data is heavily bunched up against the left side.

p + geom_point() # without scaling

p + geom_point() +
  scale_x_log10() # scales the axis of a plot to a log 10 basis

p + geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'

1.1.8 Labels and guides

scales package has some useful premade formatting functions. You can either load scales or just grab the function you need from the library using scales::

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  )
## `geom_smooth()` using formula 'y ~ x'

1.1.9 Themes

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()
## `geom_smooth()` using formula 'y ~ x'

1.1.10 ggsave

figure_example <- p + geom_point(alpha = 0.3) +
  geom_smooth(method = "gam", color = "red") +
  scale_x_log10(labels = scales::dollar) +
  labs(
    x = "log GDP",
    y = "Life Expectancy",
    title = "A Gapminder Plot",
    subtitle = "Data points are country-years",
    caption = "Source: Gapminder"
  ) +
  theme_economist()

ggsave(figure_example, here("outputs", "figure_example.png"))